11 research outputs found

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Get PDF
    Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules

    A SYNERGISTIC COMBINATION OF SIGNAL PROCESSING AND DEEP LEARNING FOR ROBUST SPEECH RECOGNITION

    No full text
    When speech is captured with a distant microphone, it includes distortions caused by noise, reverberation, and overlapping speakers. Far-field speech processing systems need to be robust to those distortions to function in real-world applications and hence have front-end components to handle them. Ideally, these systems must be equipped to answer, “who is speaking, what, when, and from where?” to be complete. The typical front-end components used currently have two issues: (1) they are optimized based on signal reconstruction objectives, and (2) they don't try to localize the direction of the speakers explicitly. This makes the overall speech processing system sub-optimal as the front-end is optimized independent of the downstream task and unexplainable as it doesn't address “where?”. In the first part of the thesis, carefully designed multichannel speech enhancement or speech separation subnetworks are encompassed inside a sequence-to-sequence automatic speech recognition (ASR) system. Although the entire network is trained only based on ASR error minimization criteria, the intermediate outputs after each subnetwork can be interpreted perceptually. The proposed systems optimized based on the ASR objective improve the recognition performance and give comparable results on various signal level metrics compared to traditional cascaded systems. The second part of the thesis starts with introducing novel supervised learning methods for multi-source localization. The significance of these methods as an effective front-end for multi-talker speech recognition with the potential to reduce the word error rates by about a factor of two is established. Finally, combining the strengths of joint optimization and location-driven systems, the thesis introduces a new paradigm for handling far-field multi-speaker data in an end-to-end neural network manner, called directional ASR. In directional ASR, the azimuth angle of the sources with respect to the microphone array is defined as a latent variable. This angle controls the quality of separation, which in turn determines the ASR performance. All three functionalities of directional ASR: localization, separation, and recognition are connected as a single differentiable neural network and trained solely based on ASR error minimization objectives. Directional ASR outperforms a strong baseline in both separation quality and ASR performance
    corecore